Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB) by JoeProAI · Pull Request #861 · openai/parameter-golf

JoeProAI · 2026-03-26T15:58:39Z

11L U-Net + Int5 QAT + Score-First Legal TTT

3-seed mean val_bpb: 1.13391 (std 0.00153) | 15.51 MB (16,265,723 bytes) | 8xH100 (~37 min)

What's different

Built on the PR #549 stack. Key additions:

Int5 QAT — weights quantized to [-15, 15] per-row (stored int8 + float16 scale). Tighter than int6, better zstd compression ratio.
Score-first TTT — AdamW on MLP-only params (up_proj, down_proj, gate_proj, scale). lr=0.0004, 1 epoch. Order: score chunk first, then adapt. Legal per PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 recipe.
MLP_HIDDEN=1536 — reduced from 1792 to fit artifact under 16 MB with int5.
15% weight pruning — zero smallest weights pre-quantization for better zstd compression.
Bigram hash embedding — 4096 buckets, 128-dim, added to token embeddings.
XSA on all 11 layers — full U-Net cross-layer shared attention.
Warmdown 6000 steps — longer QAT phase for better weight clustering near int5 boundaries.

3-Seed Results

Seed	val_bpb	Artifact
42 (submitted artifact)	1.13256182	15.51 MiB
314	1.13557402	15.60 MiB
2025	1.13360681	15.59 MiB
Mean	1.13391
Std	0.00153

All three seeds individually beat official SOTA (#549, 1.1194) by >0.01 BPB. All artifacts under 16 MiB.

Architecture

Param	Value
Layers	11
Model dim	512
Heads	8
MLP hidden	1536
Bigram buckets	4096
Bigram embed dim	128
Vocab size	256
Tie embeddings	false

Rule Compliance

Score-first TTT: tokens scored under inference_mode() before training on them
No val tokens used in artifact or training
No pre-eval adaptation
Submitted artifact: 15.51 MiB (under 16 MiB limit)
All validation artifacts under 16 MiB
Training time: ~37 min | Eval time: ~192s (under 600s budget)
3-seed validation (seeds 42, 314, 2025)

Train log, submission.json, and training script included.

…submission

…g to fit int6 under 16MB - INT6_CLIP_PERCENTILE now reads from env (default 99.99984, wave46 uses 99.0) - PRUNE_PCT added to 1.0677 script (was missing, wave46 uses 0.25) - Modal harness wave46_clip_prune.py for detached runs - Both levers push zeros into weight tensors for better zstd compression - Base architecture: SwiGLU + U-Net + XSA4 + BigramHash(8192) = 1.0677 BPB pre-compression

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)

3d6acb1

JoeProAI mentioned this pull request Mar 26, 2026

Record: SwiGLU+VE128+NoTTT val_bpb=1.1181 (3-seed mean) #505

Closed

JoeProAI added 2 commits March 26, 2026 18:15

Add RESULTS.md, requirements.txt, and run_training.sh to PR openai#861 …

b68b95d

…submission

This was referenced Mar 28, 2026

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1336 (15.59 MiB) #1040

Closed

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1356 (15.60 MiB) #1041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)#861

Non-record: 11L Int5 QAT + Score-First TTT — val_bpb 1.1326 (15.51 MB)#861
JoeProAI wants to merge 3 commits intoopenai:mainfrom
JoeProAI:submission/joeproai-11l-int5-ttt-1.1326

JoeProAI commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoeProAI commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

11L U-Net + Int5 QAT + Score-First Legal TTT

What's different

3-Seed Results

Architecture

Rule Compliance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoeProAI commented Mar 26, 2026 •

edited

Loading